2016-07-10

Your turn

  • What is multivariate data?
  • What makes multivariate analysis different from univariate analysis?

Main types of plots

  • pairwise plots: explore replicates, and associations
  • porcupine plots: to explore treatments vs replicates
  • parallel coordinate plots: explore high-dimensional relationships with parallel axes

GGally

Classical scatterplot matrix (draftman's) plot. All variables are numeric, show all possible pairs.

data(flea)
ggscatmat(flea, columns = 2:4, color = "species") 

Generalized pairs plot, any combination of types of plots

data(tips, package = "reshape")
ggpairs(tips[, 1:4])

For biological data

Brown, A. and Hudson, K. (2015) Developmental profiling of gene expression in soybean trifoliate leaves and cotyledons, GSE61857.

coty <- read_delim("data/GSE61857_Cotyledon_normalized.txt.gz",
  delim="\t", col_types="cddddddddd", 
  col_names=c("ID", "C_S1_R1", "C_S1_R2", "C_S1_R3", 
  "C_S2_R1", "C_S2_R2", "C_S2_R3", "C_S3_R1", "C_S3_R2", "C_S3_R2"),
  skip=1)
coty <- as.data.frame(coty)
ggscatmat(coty, columns=2:7, alpha=0.1)

Normalized data for the first 6 samples are shown.

What we learn

  • The replicates look more like each other than the developmental stage 1 vs 2. This is great data!
  • There is a lot of variability in counts across stages, would indicate a lot of gene activity.
  • There are a few genes that have some difference from one rep to another. These could be problematic when it comes to testing. To identify these it is easier to use interactive graphics - stay tuned for this tomorrow.

Your turn

  • Read the leaf data in from Brown and Hudson, and make a scatterplot matrix. (If it is too slow on your computer, sample the number of rows, to get a glimpse of the data to start.)
  • Summarise what you learn about the data in a couple of sentences

Porcupine plots

Compare treatment vs reps. Focus on reps 1, 2 from stages 1 and 2.

sub <- coty %>% select(ID, C_S1_R1, C_S1_R2, C_S2_R1, C_S2_R2)
ggplot(sub, aes(x=C_S1_R1, xend=C_S1_R2, y=C_S2_R1, yend=C_S2_R2)) +
  geom_segment() + 
  theme(aspect.ratio = 1)

The best prospects should be shortest lines further from the X=Y.

Your turn

Make a porcupine plot for the first two reps of the first two developmental stages of the leaf data.

Parallel coordinate plots

  • Parallel coordinate plots, use parallel axes, and each observation is represented by a line tracing through the axes.
  • They are useful for getting a sense of the multivariate shape of the data. You can typically read association, clustering and outliers from the plot.
  • Can handle more variables than scatterplot matrices.

Side-by-side boxplots to par coords

ggparcoord(coty, columns=2:10, scale="globalminmax", boxplot=TRUE,
           alphaLines=0)

If the data is normalised well these should be identical boxplots, or very close to that. This is not quite what we see.

Parallel coordinates, connect the dots

ggparcoord(coty, columns=2:10, scale="globalminmax", 
           alphaLines=0.1)

Scaling choices

  • std: univariately, subtract mean and divide by standard deviation
  • robust: univariately, subtract median and divide by median absolute deviation
  • uniminmax: univariately, scale so the minimum of the variable is zero, and the maximum is one
  • globalminmax: no scaling is done; the range of the graphs is defined by the global minimum and the global maximum

Your turn

  • Make parallel coordinate plots for the 9 samples of the leaf data
  • Start with side-by-side boxplots, and then generate the lines
  • Run the different scaling options to discover how they change the appearance of the plot

Parcoords to interaction plots

Experiments are often displayed using interaction plots. These are similar to parallel coordinates, which display all the measurements, but interaction plots show one gene at a time.

Process the data, and then plot

sub <- coty[sample(1:nrow(coty), 6),] %>% 
  gather(sample, expr, -ID) %>%
  separate(sample, c("tissue", "stage", "rep")) %>%
  mutate(stage = as.numeric(substr(stage, 2, 2)))
sub.m <- sub %>% group_by(ID, stage) %>% 
  summarise(expr = mean(expr))

ggplot(sub, aes(x=stage, y=expr)) + geom_point(alpha=0.4) +
  facet_wrap(~ID, ncol=3) + 
  geom_line(data=sub.m, aes(x=stage, y=expr), colour="red")

Your turn

Re-make the previous plot, by overlaying them in one plot, instead of faceting, and colour by the ID.

With significance testing

d <- DGEList(counts = coty[,-1], 
  group = c(rep("S1", 3), rep("S2", 3), rep("S3", 3)), 
  genes = coty[,1])
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
de <- glmFit(d, design=matrix(c(rep(1, 9),
                                c(1,1,1,0,0,0,0,0,0), 
                                c(0,0,0,1,1,1,0,0,0)),
                              ncol=3, byrow=F))
results <- glmLRT(de)
top10 <- as.data.frame(topTags(results, n=12))

Plot

sub <- coty %>% 
  gather(sample, expr, -ID) %>%
  separate(sample, c("tissue", "stage", "rep")) %>%
  mutate(stage = as.numeric(substr(stage, 2, 2)))
top10 <- merge(top10, sub, by.x = "genes", by.y = "ID")
top10.m <- top10 %>% group_by(genes, stage) %>% 
  summarise(expr = mean(expr))
ggplot(top10, aes(x=stage, y=expr, colour=genes)) + 
  geom_point(alpha=0.4) + ylim(c(0,17.5)) +
  geom_line(data=top10.m, aes(x=stage, y=expr, colour=genes))

Your turn

Plot the next 12 most significant genes.

Extracting the significant genes

Just look pairwise for simplicity here.

d <- DGEList(counts = coty[,2:7], 
  group = c(rep("S1", 3), rep("S2", 3)), 
  genes = coty[,1])
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
d <- estimateTrendedDisp(d)
de <- exactTest(d, pair=c("S1", "S2"), dispersion = "trended")
sig.tab <- de$table
sig.tab$genes <- coty$ID
sig.tab <- sig.tab %>% filter(PValue < 0.01)

sig.tab <- merge(sig.tab, coty[,1:7], by.x="genes", by.y="ID")
ggscatmat(sig.tab, columns=5:10, alpha=0.1)

ggparcoord(sig.tab, columns=5:10, scale="globalminmax",
           alphaLines=0.1)

Your turn

  • Test the first two stages of the leaf data.
  • Plot the genes that have FDR less than 0.01 as a scatterplot matrix, and as parallel coordinate plot

Your turn

Make the porcupine plot for the two stages, just for rep 1 and rep 2, only the genes that have FDR less than 0.01

Resources

Share and share alike

This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/ 3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.